Open In Colab

4 Activation Functions

“If you know what you’re doing, it’s not research.” - Albert Einstein

In the history of deep learning, activation functions and optimization techniques have made significant progress. When the McCulloch-Pitts artificial neuron model first appeared in 1943, it used only a simple threshold function (step function). This mimicked the biological neuron’s behavior, where the neuron is activated only when the input exceeds a certain threshold. However, such simple forms of activation functions limited the ability of neural networks to express complex functions.

Until the 1980s, machine learning focused on feature engineering and sophisticated algorithm design. Neural networks were just one of many machine learning algorithms, and traditional algorithms like SVM (Support Vector Machine) or Random Forest often performed better. For example, in the MNIST handwritten digit recognition problem, SVM maintained the highest accuracy until 2012.

In 2012, AlexNet achieved overwhelming performance in the ImageNet challenge using efficient learning with GPUs, marking the beginning of the deep learning era. In 2017, Google’s Transformer architecture further advanced this innovation, becoming the basis for today’s large language models (LLMs) like GPT-4 and Gemini.

At the center of these advances were the evolution of activation functions and the development of optimization techniques. In this chapter, we will delve into activation functions in detail, providing the theoretical foundation you need to develop new models and solve complex problems.

4.1 Activation Functions: Introducing Non-Linearity into Neural Networks

Researcher’s Dilemma: Early neural network researchers realized that linear transformations alone could not solve complex problems. However, it was unclear which non-linear function would allow the neural network to learn effectively and solve various problems. Should they mimic the behavior of biological neurons or use other functions with better mathematical and computational properties?

Activation functions are the key elements that introduce non-linearity between neural network layers. The Universal Approximation Theorem (1988) mentioned in Section 1.4.1 proved that a neural network with one hidden layer and a non-linear activation function can approximate any continuous function. In other words, activation functions enable neural networks to transcend the limitations of simple linear models and act as universal function approximators by separating layers and introducing non-linearity.

4.1.1 Why Activation Functions are Needed: Overcoming the Limitations of Linearity

Without activation functions, no matter how many layers are stacked, the neural network would ultimately be equivalent to a linear transformation. This can be simply proven as follows:

Consider applying two linear transformations in sequence:

  • First layer: \(y_1 = W_1x + b_1\)
  • Second layer: \(y_2 = W_2y_1 + b_2\)

where \(x\) is the input, \(W_1\) and \(W_2\) are weight matrices, and \(b_1\) and \(b_2\) are bias vectors. By substituting the first layer’s equation into the second layer’s equation:

\(y_2 = W_2(W_1x + b_1) + b_2 = (W_2W_1)x + (W_2b_1 + b_2)\)

Defining a new weight matrix \(W' = W_2W_1\) and a new bias vector \(b' = W_2b_1 + b_2\):

\(y_2 = W'x + b'\)

This is equivalent to a single linear transformation. The same applies no matter how many layers are stacked. Ultimately, linear transformations alone cannot express complex non-linear relationships. ### 4.1.2 Evolution of Activation Functions: From Biological Inspiration to Efficient Computation

  • 1943, McCulloch-Pitts Neuron: The first artificial neuron model used a simple threshold function, or step function, which mimicked the biological neuron’s behavior of only activating when the input exceeded a certain threshold.

    \[ f(x) = \begin{cases} 1, & \text{if } x \ge \theta \\ 0, & \text{if } x < \theta \end{cases} \]

    Here, \(\theta\) is the threshold.

  • 1960s, Sigmoid Function: The sigmoid function was introduced to model the firing rate of biological neurons more smoothly. The sigmoid function is an S-shaped curve that compresses the input value into a value between 0 and 1.

    \[ \sigma(x) = \frac{1}{1 + e^{-x}} \]

    The sigmoid function has the advantage of being differentiable, allowing it to be used with gradient descent-based learning algorithms. However, the sigmoid function was also identified as one of the causes of the vanishing gradient problem in deep neural networks. When the input value is very large or small, the gradient (derivative) of the sigmoid function approaches 0, causing learning to slow down or stop.

  • 2010, ReLU (Rectified Linear Unit): Nair and Hinton proposed the ReLU function, opening a new era in deep neural network learning. ReLU has a very simple form.

    \[ ReLU(x) = \max(0, x) \]

    ReLU outputs the input as is if it is greater than 0, and outputs 0 if it is less than 0. Unlike the sigmoid function, ReLU has fewer vanishing gradient problems and higher computational efficiency. Thanks to these advantages, ReLU has greatly contributed to the success of deep neural networks and is now one of the most widely used activation functions.

4.1.3 Choosing Activation Functions: Model Size, Task, and Efficiency

The choice of activation function has a significant impact on the performance and efficiency of the model.

  • Large Language Models (LLMs): Since computational efficiency is crucial, there is a tendency to prefer simpler activation functions. The latest base models, such as Llama 3, GPT-4, and Gemini, adopt simple and efficient activation functions like GELU (Gaussian Error Linear Unit) or ReLU. In particular, Gemini 1.5 introduces the MoE (Mixture of Experts) architecture, which uses optimized activation functions for each expert network.

  • Specialized Models: When developing models optimized for specific tasks, more sophisticated approaches are being attempted. For example, in recent research like TEAL, methods to improve inference speed by up to 1.8 times through activation sparsity have been proposed. Additionally, studies on using adaptive activation functions that dynamically adjust their behavior based on the input data are also underway.

The choice of activation function should be made considering the model size, task characteristics, available computational resources, and required performance characteristics (accuracy, speed, memory usage, etc.).

4.2 Comparison of Activation Functions

Challenge: Among numerous activation functions, which one is most suitable for a specific problem and architecture?

Researcher’s Dilemma: As of 2025, over 500 activation functions have been proposed, but there is no single perfect activation function for all situations. Researchers must understand the characteristics of each function and consider the problem’s characteristics, model architecture, computational resources, and more to select the optimal activation function or even develop a new one.

The properties generally required for an activation function are as follows: 1. It must add non-linearity to the neural network 2. It should not increase computational complexity to the point of making training difficult 3. It must be differentiable so as not to hinder gradient flow 4. The data distribution at each layer of the neural network should be appropriate during training

Many efficient activation functions that meet these requirements have been proposed. It’s hard to say which activation function is the best, as it depends on the model being trained and the data. The way to find the optimal activation function is through actual testing.

As of 2025, activation functions can be broadly classified into three categories: 1. Classical activation functions: Sigmoid, Tanh, ReLU, etc., which are fixed-shaped functions. 2. Adaptive activation functions: PReLU, TeLU, STAF, etc., which include parameters that adjust their shape during the learning process. 3. Specialized activation functions: ENN (Expressive Neural Network), Physics-informed activation functions, etc., which are optimized for specific domains.

This chapter compares several activation functions, primarily focusing on those implemented in PyTorch, but also implementing others like Swish and STAF by inheriting from nn.Module. The full implementation can be found in chapter_04/models/activations.py.

4.2.1 Creating Activation Functions

Code
!pip install dldna[colab] # in Colab
# !pip install dldna[all] # in your local

%load_ext autoreload
%autoreload 2
Code
import torch
import torch.nn as nn
import numpy as np

# Set seed
np.random.seed(7)
torch.manual_seed(7)

# STAF (Sinusoidal Trainable Activation Function)
class STAF(nn.Module):
    def __init__(self, tau=25):
        super().__init__()
        self.tau = tau
        self.C = nn.Parameter(torch.randn(tau))
        self.Omega = nn.Parameter(torch.randn(tau))
        self.Phi = nn.Parameter(torch.randn(tau))

    def forward(self, x):
        result = torch.zeros_like(x)
        for i in range(self.tau):
            result += self.C[i] * torch.sin(self.Omega[i] * x + self.Phi[i])
        return result

# TeLU (Trainable exponential Linear Unit)
class TeLU(nn.Module):
    def __init__(self, alpha=1.0):
        super().__init__()
        self.alpha = nn.Parameter(torch.tensor(alpha))

    def forward(self, x):
        return torch.where(x > 0, x, self.alpha * (torch.exp(x) - 1))

# Swish (Custom Implementation)
class Swish(nn.Module):
    def forward(self, x):
        return x * torch.sigmoid(x)

# Activation function dictionary
act_functions = {
    # Classic activation functions
    "Sigmoid": nn.Sigmoid,     # Binary classification output layer
    "Tanh": nn.Tanh,          # RNN/LSTM

    # Modern basic activation functions
    "ReLU": nn.ReLU,          # CNN default
    "GELU": nn.GELU,          # Transformer standard
    "Mish": nn.Mish,          # Performance/stability balance

    # ReLU variants
    "LeakyReLU": nn.LeakyReLU,# Handles negative inputs
    "SiLU": nn.SiLU,          # Efficient sigmoid
    "Hardswish": nn.Hardswish,# Mobile optimized
    "Swish": Swish,           # Custom implementation

    # Adaptive/trainable activation functions
    "PReLU": nn.PReLU,        # Trainable slope
    "RReLU": nn.RReLU,        # Randomized slope
    "TeLU": TeLU,             # Trainable exponential
    "STAF": STAF             # Fourier-based
}

STAF is a recently introduced activation function at ICLR 2025, which uses Fourier series-based learnable parameters. ENN adopts a method to improve the representation of the network by utilizing DCT. TeLU is an extended version of ELU, where the alpha parameter is made learnable.

4.2.2 Visualization of Activation Functions

The activation functions and gradients are visualized to compare their characteristics. Using PyTorch’s automatic differentiation feature, gradients can be calculated simply by calling backward(). The following is an example of visually analyzing the characteristics of activation functions. The calculation of gradient flow is done by passing a given activation function through a constant range of input values. The compute_gradient_flow method plays this role.

Code
def compute_gradient_flow(activation, x_range=(-5, 5), y_range=(-5, 5), points=100):
    """
    Computes the 3D gradient flow.

    Calculates the output surface of the activation function for two-dimensional
    inputs and the magnitude of the gradient with respect to those inputs.

    Args:
        activation: Activation function (nn.Module or function).
        x_range (tuple): Range for the x-axis (default: (-5, 5)).
        y_range (tuple): Range for the y-axis (default: (-5, 5)).
        points (int): Number of points to use for each axis (default: 100).

    Returns:
        X, Y (ndarray): Meshgrid coordinates.
        Z (ndarray): Activation function output values.
        grad_magnitude (ndarray): Gradient magnitude at each point.
    """
    x = np.linspace(x_range[0], x_range[1], points)
    y = np.linspace(y_range[0], y_range[1], points)
    X, Y = np.meshgrid(x, y)

    # Stack the two dimensions to create a 2D input tensor (first row: X, second row: Y)
    input_tensor = torch.tensor(np.stack([X, Y], axis=0), dtype=torch.float32, requires_grad=True)

    # Construct the surface as the sum of the activation function outputs for the two inputs
    Z = activation(input_tensor[0]) + activation(input_tensor[1])
    Z.sum().backward()

    grad_x = input_tensor.grad[0].numpy()
    grad_y = input_tensor.grad[1].numpy()
    grad_magnitude = np.sqrt(grad_x**2 + grad_y**2)

Performs 3D visualization for all defined activation functions.

Code
from dldna.chapter_04.visualization.activations import visualize_all_activations

visualize_all_activations()

The graph represents the output value (Z-axis) and gradient magnitude (heatmap) for two inputs (X-axis, Y-axis).

  1. Sigmoid: It is in an “S” shape. Both ends converge to 0 and 1, are flat, and the middle is steep. It compresses the input into a range of 0 to 1. The gradient disappears near 0 at both ends and is large in the middle. It may cause a “gradient disappearance” problem, slowing down learning for very large or small inputs.

  2. ReLU: It has a sloping shape. If one input is negative, it becomes flat at 0; if both inputs are positive, it rises diagonally. The gradient is 0 for negative inputs and constant for positive ones. Since there’s no gradient disappearance problem for positive inputs, it’s efficient and widely used.

  3. GELU: Similar to Sigmoid but smoother. The left side is slightly curved downward, and the right side exceeds 1. The gradient changes gradually without any 0-value interval. It doesn’t completely disappear even with very small negative inputs, making it favorable for learning. It’s used in newer models like transformers.

  4. STAF: Wave-shaped, based on the sine function, with learnable parameters to adjust amplitude, frequency, and phase. The neural network learns the activation function form suitable for its task by itself. The gradient changes complexly. Favorable for learning non-linear relationships.

The 3D graph (Surface) represents the output value of the activation function for two inputs added together and displayed on the Z-axis. The heatmap (Gradient Magnitude) shows the size of the gradient, i.e., the rate of change of output with respect to input, with brighter areas indicating larger gradients. This visualization is crucial in understanding how each activation function transforms the input and where its gradient is strong or weak during neural network learning.

4.2.3 Comparison Table of Activation Functions

Activation functions are key elements that provide non-linearity to neural networks, and their characteristics are well-represented in gradient forms. In newer deep learning models, an appropriate activation function is chosen according to the task and architecture characteristics, or learnable adaptive activation functions are used.

Comparison Summary of Activation Functions

Category Activation Function Characteristics Primary Use Advantages and Disadvantages
Classical Sigmoid Normalizes output to 0~1, capturing continuous characteristic changes with a smooth gradient Binary classification output layer May cause gradient disappearance in deep neural networks
Classical Tanh Similar to sigmoid, but output is -1~1, showing a steeper gradient near 0, making learning effective RNN/LSTM gate Output is centralized, advantageous for learning, but still may cause gradient disappearance
Modern Basic ReLU Simple structure with a gradient of 0 when x is less than 0 and 1 when greater than 0, useful for boundary detection CNN basic Extremely efficient computation, but neurons are completely deactivated for negative inputs
Modern Basic GELU Combines ReLU characteristics with Gaussian cumulative distribution function, providing smooth non-linearity Transformer Natural regularization effect, but higher computational cost than ReLU
Modern Basic Mish Has a smooth gradient and self-normalization characteristics, showing stable performance in various tasks General purpose Good balance between performance and stability, but increased computational complexity
ReLU Variant LeakyReLU Allows a small slope for negative inputs, reducing information loss CNN Mitigates dead neuron problem, but requires manual setting of slope value
ReLU Variant Hardswish Designed as a computationally efficient version for mobile networks Mobile network Efficient due to lightweight structure, but expression is somewhat limited
ReLU Variant Swish Multiplied by x and sigmoid, providing a smooth gradient and weak boundary effect Deep network Stable learning due to soft boundaries, but increased computational cost
Adaptive PReLU Can learn the slope in the negative region, finding the optimal shape according to data CNN Adapts to data, but additional parameters increase overfitting risk
Adaptive RReLU Uses a random slope in the negative region during training to prevent overfitting General purpose Has a regularization effect, but results may lack reproducibility
Adaptive TeLU Learns the scale of the exponential function, enhancing ELU’s advantages and adjusting to data General purpose Enhances ELU’s advantages, but convergence may be unstable
Adaptive STAF Based on Fourier series, learning complex non-linear patterns with high expression power Complex pattern Highly expressive, but high computational cost and memory usage

4.3 Visualizing the Impact of Activation Functions in Neural Networks

Let’s analyze the impact of activation functions on the learning process of neural networks using the FashionMNIST dataset. Since the backpropagation algorithm was rehighlighted in 1986, the choice of activation function has become one of the most important factors in neural network design. In particular, the role of activation functions has become more crucial in deep neural networks to solve the gradient vanishing/exploding problem. Recently, self-adaptive activation functions and optimal activation function selection through Neural Architecture Search (NAS) have gained attention. Especially in transformer-based models, data-dependent activation functions are becoming the standard.

For experimentation, we use a simple classification model called SimpleNetwork. This model converts 28x28 images into 784-dimensional vectors, passes them through configurable hidden layers, and classifies them into 10 classes. To clearly see the impact of activation functions, we compare models with and without activation functions.

Code
import torch.nn as nn
from torchinfo import summary
from dldna.chapter_04.models.base import SimpleNetwork
from dldna.chapter_04.utils.data import get_device

device = get_device()

model_relu = SimpleNetwork(act_func=nn.ReLU()).to(device) # 테스트용으로 ReLu를 선언한다.
model_no_act = SimpleNetwork(act_func=nn.ReLU(), no_act = True).to(device) # 활성화 함수가 없는 신경망을 만든다.

summary(model_relu, input_size=[1, 784])
summary(model_no_act, input_size=[1, 784])
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
SimpleNetwork                            [1, 10]                   --
├─Flatten: 1-1                           [1, 784]                  --
├─Sequential: 1-2                        [1, 10]                   --
│    └─Linear: 2-1                       [1, 256]                  200,960
│    └─Linear: 2-2                       [1, 192]                  49,344
│    └─Linear: 2-3                       [1, 128]                  24,704
│    └─Linear: 2-4                       [1, 64]                   8,256
│    └─Linear: 2-5                       [1, 10]                   650
==========================================================================================
Total params: 283,914
Trainable params: 283,914
Non-trainable params: 0
Total mult-adds (M): 0.28
==========================================================================================
Input size (MB): 0.00
Forward/backward pass size (MB): 0.01
Params size (MB): 1.14
Estimated Total Size (MB): 1.14
==========================================================================================

Load and preprocess the dataset.

Code
from torchinfo import summary
from dldna.chapter_04.utils.data import get_data_loaders

train_dataloader, test_dataloader  = get_data_loaders()

train_dataloader
<torch.utils.data.dataloader.DataLoader at 0x72be38d40700>

Gradient flow is at the core of neural network learning. As layers get deeper, gradients are continually multiplied by the chain rule, which can lead to gradient disappearance or explosion during this process. For example, in a 30-layer neural network, the gradient goes through 30 multiplications until it reaches the input layer. The activation function adds non-linearity and gives inter-layer independence to regulate the gradient flow in this process.

The following code visualizes the gradient distribution of a model using the ReLU activation function.

Code
from dldna.chapter_04.visualization.gradients import visualize_network_gradients

visualize_network_gradients()

You can analyze the characteristics of the activation function by visualizing the gradient distribution of each layer as a histogram. For ReLU, the output layer shows a gradient value of 10^-2 scale and the input layer shows a gradient value of 10^-3 scale. PyTorch uses He(Kaiming) initialization by default, which is optimized for ReLU series activation functions. Other initialization methods such as Xavier, Orthogonal are also available, and these will be covered in detail in the initialization section.

Code
from dldna.chapter_04.models.activations import act_functions
from dldna.chapter_04.visualization.gradients import get_gradients_weights, visualize_distribution

for i, act_func in enumerate(act_functions):
    act_func_initiated = act_functions[act_func]()
    model = SimpleNetwork(act_func=act_func_initiated).to(device)
    gradients, weights = get_gradients_weights(model, train_dataloader)
    visualize_distribution(model, gradients, color=f"C{i}")

Looking at the gradient distribution by activation function, we can see that Sigmoid shows very small values of \(10^{-5}\) scale from the input layer, which means that the gradient disappearance problem may occur. ReLU has a gradient concentrated around 0, which is due to the characteristic of deactivation (dead neuron) for negative inputs. The latest adaptive activation functions alleviate these problems while maintaining non-linearity. For example, GELU shows a gradient distribution close to a normal distribution, which has a good effect along with batch normalization. Let’s compare it with the case without an activation function.

Code
from dldna.chapter_04.models.base import SimpleNetwork

model_no_act = SimpleNetwork(act_func=nn.ReLU(), no_act = True).to(device) 

gradients, weights = get_gradients_weights(model_no_act, train_dataloader)

visualize_distribution(model_no_act, gradients, title="gradients")

If there is no activation function, the distribution between layers is similar and only the scale changes. This shows that there is no nonlinearity and the feature transformation between layers is limited.

4.4 Model Training

To objectively compare the performance of activation functions, experiments are conducted using the FashionMNIST dataset. As of 2025, there are over 500 activation functions, but in actual deep learning projects, a small number of validated activation functions are mainly used. First, let’s take a look at the basic training process based on ReLU.

4.4.1 Single Model Training

Code
import torch.optim as optim
from dldna.chapter_04.experiments.model_training import train_model
from dldna.chapter_04.models.base import SimpleNetwork
from dldna.chapter_04.utils.data import get_device
from dldna.chapter_04.visualization.training import plot_results

model = SimpleNetwork(act_func=nn.ReLU()).to(device)
optimizer = optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)
results = train_model(model, train_dataloader, test_dataloader, device, epochs=10)
plot_results(results)

Starting training for SimpleNetwork-ReLU.
Execution completed for SimpleNetwork-ReLU, Execution time = 76.1 secs

4.4.2 Model Training Based on Activation Functions

Now we conduct comparative experiments on major activation functions. We keep the composition and training conditions of each model identical to ensure a fair comparison. - 4 hidden layers [256, 192, 128, 64] - SGD optimizer (learning rate=1e-3, momentum=0.9) - Batch size 128 - Trained for 15 epochs

Code
from dldna.chapter_04.experiments.model_training import train_all_models
from dldna.chapter_04.visualization.training import create_results_table


from dldna.chapter_04.experiments.model_training import train_all_models
from dldna.chapter_04.visualization.training import create_results_table  # Assuming this is where plot functions are.

# Train only selected models
# selected_acts = ["ReLU"]  # Select only the desired activation functions
selected_acts = ["Tanh", "ReLU", "Swish"]
# selected_acts = ["Sigmoid", "ReLU", "Swish", "PReLU", "TeLU", "STAF"]
# selected_acts = ["Sigmoid", "Tanh", "ReLU", "GELU", "Mish", "LeakyReLU", "SiLU", "Hardswish", "Swish", "PReLU", "RReLU", "TeLU", "STAF"]
# results_dict = train_all_models(act_functions, train_dataloader, test_dataloader,
#                               device, epochs=15, selected_acts=selected_acts)
results_dict = train_all_models(act_functions, train_dataloader, test_dataloader,
                              device, epochs=15, selected_acts=selected_acts, save_epochs=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])

create_results_table(results_dict)

The results came out as shown in the table below. The values will vary depending on each execution environment.

모델 정확도(%) 최종 오차(%) 걸린 시간 (초)
SimpleNetwork-Sigmoid 10.0 2.30 115.6
SimpleNetwork-Tanh 82.3 0.50 114.3
SimpleNetwork-ReLU 81.3 0.52 115.2
SimpleNetwork-GELU 80.5 0.54 115.2
SimpleNetwork-Mish 81.9 0.51 113.4
SimpleNetwork-LeakyReLU 80.8 0.55 114.4
SimpleNetwork-SiLU 78.3 0.59 114.3
SimpleNetwork-Hardswish 76.7 0.64 114.5
SimpleNetwork-Swish 78.5 0.59 116.1
SimpleNetwork-PReLU 86.0 0.40 114.9
SimpleNetwork-RReLU 81.5 0.52 114.6
SimpleNetwork-TeLU 86.2 0.39 119.6
SimpleNetwork-STAF 85.4 0.44 270.2

Analyzing the experimental results, we can see that

  1. Computational Efficiency: Tanh, ReLU, etc. are the fastest, while STAF is relatively slow due to complex calculations.

  2. Accuracy:

    • Adaptive activation functions (TeLU 86.2%, PReLU 86.0%, STAF 85.4%) show overall superior performance.
    • Classical Sigmoid has very low performance (10.0%) due to the gradient vanishing problem.
    • Modern basic activation functions (ReLU, GELU, Mish) show stable performance in the range of 80-82%.
  3. Stability:

    • Tanh, ReLU, and Mish show relatively stable learning curves.
    • Adaptive activation functions show high performance but have more variability during the learning process.

These results are comparative under specific conditions, so when selecting an activation function for actual projects, consider the following factors: 1. compatibility with model architecture (e.g., GELU is recommended for transformers) 2. constraints on computational resources (consider Hardswish in mobile environments) 3. characteristics of the task (Tanh is still useful for time series prediction) 4. model size and dataset characteristics

As of 2025, it is standard to use GELU for large language models for computational efficiency, ReLU series for computer vision, and adaptive activation functions for reinforcement learning.

4.5 Trained Model’s Layer-wise Output and Dead Neuron Analysis

Previously, we examined the distribution of gradient values for each layer in the backpropagation of the initial model. Now, let’s look at what values each layer outputs in the forward calculation using the trained model. Analyzing the output of each layer of the trained model is crucial for understanding the representational power and learning patterns of neural networks. Since the introduction of ReLU in 2010, the problem of dead neurons has become a major consideration in deep neural network design.

First, we visualize the distribution of outputs for each layer in the forward calculation of the trained model.

4.5.1 Layer-wise Output Distribution Visualization

Code
import os
from dldna.chapter_04.utils.metrics import load_model
from dldna.chapter_04.utils.data import get_data_loaders, get_device
from dldna.chapter_04.visualization.gradients import get_model_outputs, visualize_distribution


device = get_device()
# Re-define the data loaders.
train_dataloader, test_dataloader = get_data_loaders()

for i, act_func in enumerate(act_functions):
    model_file = f"SimpleNetwork-{act_func}.pth"
    model_path = os.path.join("./tmp/models", model_file)
    
    # Load the model only if the file exists
    if os.path.exists(model_path):
        # Load the model.
        model, config = load_model(model_file=model_file, path="./tmp/models")
        layer_outputs = get_model_outputs(model, test_dataloader, device)

        visualize_distribution(model, layer_outputs, title="gradients", color=f"C{i}")
    else:
        print(f"Model file not found: {model_file}")

4.5.2 The Problem of Dead Neurons

Dead neurons (inactive neurons) refer to neurons that always output 0 for all inputs. This is a particularly important issue in the ReLU family of activation functions. To find dead neurons, one can pass all training data through them and check if they always output 0. This can be done by taking the output values for each layer and using logical operations to mask when they are always 0.

Code
# 3 samples (1 batch), 5 columns (each a neuron's output). Columns 1 and 3 always show 0.
batch_1 = torch.tensor([[0, 1.5, 0, 1, 1],
                        [0, 0,  0, 0, 1],
                        [0, 1,  0, 1.2, 1]])

# Column 3 always shows 0
batch_2 = torch.tensor([[1.1, 1, 0, 1, 1],
                        [1,   0, 0, 0, 1],
                        [0,   1, 0, 1, 1]])

print(batch_1)
print(batch_2)

# Use the .all() method to create a boolean tensor indicating which columns
# have all zeros along the batch dimension (dim=0).
batch_1_all_zeros = (batch_1 == 0).all(dim=0)
batch_2_all_zeros = (batch_2 == 0).all(dim=0)

print(batch_1_all_zeros)
print(batch_2_all_zeros)

# Declare a masked_array that can be compared across the entire batch.
# Initialized to all True.
masked_array = torch.ones(5, dtype=torch.bool)
print(f"masked_array = {masked_array}")

# Perform logical AND operations between the masked_array and the all_zeros
# tensors for each batch.
masked_array = torch.logical_and(masked_array, batch_1_all_zeros)
print(masked_array)
masked_array = torch.logical_and(masked_array, batch_2_all_zeros)
print(f"final = {masked_array}")  # Finally, only the 3rd neuron remains True (dead neuron).
tensor([[0.0000, 1.5000, 0.0000, 1.0000, 1.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 1.0000],
        [0.0000, 1.0000, 0.0000, 1.2000, 1.0000]])
tensor([[1.1000, 1.0000, 0.0000, 1.0000, 1.0000],
        [1.0000, 0.0000, 0.0000, 0.0000, 1.0000],
        [0.0000, 1.0000, 0.0000, 1.0000, 1.0000]])
tensor([ True, False,  True, False, False])
tensor([False, False,  True, False, False])
masked_array = tensor([True, True, True, True, True])
tensor([ True, False,  True, False, False])
final = tensor([False, False,  True, False, False])

The function to calculate disabled neurons is calculate_disabled_neuron. It is in visualization/training.py. Let’s analyze the ratio of disabled neurons in the actual model.

Code
from dldna.chapter_04.visualization.training import calculate_disabled_neuron
from dldna.chapter_04.models.base import SimpleNetwork

# Find in the trained model.
model, _ = load_model(model_file="SimpleNetwork-ReLU.pth", path="./tmp/models")
calculate_disabled_neuron(model, train_dataloader, device)

model, _ = load_model(model_file="SimpleNetwork-Swish.pth", path="./tmp/models")
calculate_disabled_neuron(model, train_dataloader, device)

# Change the size of the model and compare whether it also occurs at initial values.
big_model = SimpleNetwork(act_func=nn.ReLU(), hidden_shape=[2048, 1024, 1024, 512, 512, 256, 128]).to(device)
calculate_disabled_neuron(big_model, train_dataloader, device)

Number of layers to compare = 4
Number of disabled neurons (ReLU) : [0, 6, 13, 5]
Ratio of disabled neurons = 0.0%
Ratio of disabled neurons = 3.1%
Ratio of disabled neurons = 10.2%
Ratio of disabled neurons = 7.8%

Number of layers to compare = 4
Number of disabled neurons (Swish) : [0, 0, 0, 0]
Ratio of disabled neurons = 0.0%
Ratio of disabled neurons = 0.0%
Ratio of disabled neurons = 0.0%
Ratio of disabled neurons = 0.0%

Number of layers to compare = 7
Number of disabled neurons (ReLU) : [0, 0, 6, 15, 113, 102, 58]
Ratio of disabled neurons = 0.0%
Ratio of disabled neurons = 0.0%
Ratio of disabled neurons = 0.6%
Ratio of disabled neurons = 2.9%
Ratio of disabled neurons = 22.1%
Ratio of disabled neurons = 39.8%
Ratio of disabled neurons = 45.3%

According to current research results, the severity of the dying neuron problem varies depending on the depth and width of the model. Notably, 1. As the model deepens, the proportion of inactive neurons in ReLU increases sharply 2. Adaptive activation functions (STAF, TeLU) effectively alleviate this problem 3. In Transformer architectures, GELU greatly reduces the dying neuron problem 4. In the latest MoE (Mixture of Experts) models, the problem is solved by using different activation functions for each expert network

Therefore, when designing neural networks with many layers, alternatives such as GELU, STAF, and TeLU should be considered instead of ReLU, and especially for ultra-large models, a choice that considers both computational efficiency and the dying neuron problem is necessary.

4.6 Activation Function Candidates Determination

The selection of activation functions is one of the crucial decision-making factors in neural network design. Activation functions directly influence the network’s ability to learn complex patterns, training speed, and overall performance. The following outlines the latest research findings and best practices organized by application domain.

Computer Vision
  • CNN-based models: ReLU and its variants (LeakyReLU, PReLU, ELU) are still widely used due to their computational efficiency and generally good performance. However, GELU and Swish/SiLU are increasingly used in deeper architectures, especially in high-performance CNNs, because they have smoother gradients.
  • Vision Transformers (ViTs): GELU has become the de facto standard in ViT, consistent with its successful use in transformers for natural language processing.
  • Mobile/Embedded Devices: Hardswish is preferred due to its computational efficiency in resource-constrained environments. ReLU and its variants (e.g., ReLU6 commonly used in MobileNets) remain strong choices.
  • Generative Models (High-Precision Image Generation): STAF has shown promising results but has not been widely adopted yet. Smoother activation functions like Swish, GELU, and Mish are preferred for generation tasks because they tend to produce higher-quality outputs and reduce artifacts. State-of-the-art diffusion models for image generation often use Swish/SiLU.
Natural Language Processing (NLP)
  • Transformer-based Models: GELU is the dominant choice in most transformer architectures (e.g., BERT, GPT).
  • RNN/LSTM: Traditionally, Tanh was preferred, but it’s being gradually replaced by activation functions that better mitigate the vanishing gradient problem. GELU and ReLU variants (with careful initialization and normalization techniques) are frequently used in modern RNN/LSTM implementations.
  • Large Language Models (LLMs): Computational efficiency is paramount. GELU and ReLU (or fast approximations of GELU) are the most common choices. Some LLMs also experiment with special activation functions within Mixture-of-Experts (MoE) layers.
Speech Processing
  • Emotion Recognition: TeLU has shown promise but is not yet a widely used standard. ReLU variants, GELU, and Swish/SiLU are strong and general candidates suitable for a wide range of applications. The optimal choice depends on the specific dataset and model architecture.
  • Speech Synthesis: Smooth activations like Snake and GELU can help produce more natural speech, making them recommended choices.
  • Real-Time Processing: Similar to mobile vision, Hardswish and ReLU variants are suitable for applications requiring low latency.

Practice Problems

4.2.1 Basic Problems

  1. Write the formulas for Sigmoid, Tanh, ReLU, Leaky ReLU, GELU, and Swish functions and draw their graphs using matplotlib or Desmos.

    • Note: Clearly understand the definition and characteristics of each function and compare them visually through graphs.
  2. Find the derivatives (differential) of each activation function and draw their graphs.

    • Note: Derivatives are used to calculate gradients in the backpropagation process. Understand the differentiability and gradient characteristics of each function.
  3. Train a neural network composed only of linear transformations without activation functions using the FashionMNIST dataset, and measure its test accuracy. (Use the SimpleNetwork implemented in Chapter 1)

    • Note: A neural network without activation functions cannot express non-linearity, so it has limitations in solving complex problems. Confirm this through experiments.
  4. Compare the results obtained from problem 3 with those of a neural network using the ReLU activation function and explain the role of activation functions.

    • Note: Compare the output values, gradients, and inactive neurons for each layer with and without activation functions to explain their roles.

4.2.2 Applied Problems

  1. Implement PReLU, TeLU, and STAF activation functions in PyTorch (inherit from nn.Module).

    • Note: Refer to the definition of each function and implement the forward method. If necessary, define learnable parameters using nn.Parameter.
  2. Train a neural network that includes the previously implemented activation functions using the FashionMNIST dataset and compare their test accuracies.

    • Note: Compare the performance of each activation function and analyze which one is more suitable for the FashionMNIST dataset.
  3. For each activation function, visualize the distribution of gradients during training and measure the ratio of “dead neurons”. (Use functions implemented in Chapter 1)

    • Note: Visualize the gradient distribution for each activation function by comparing initial values with trained values and layer-by-layer.
  4. Investigate methods to alleviate the “dead neuron” problem and explain their principles. (Leaky ReLU, PReLU, ELU, SELU, etc.)

    • Note: Explain how each method solves the problems of ReLU and discuss their advantages and disadvantages.

4.2.3 Advanced Problems

  1. Implement the Rational activation function in PyTorch and explain its characteristics and pros and cons.

    • Note: The Rational activation function is based on rational functions (fractional functions) and may show superior performance to other activation functions in certain problems.
  2. Implement B-spline or Fourier-based activation functions in PyTorch and explain their characteristics and pros and cons.

    • Note: B-spline activation functions can express locally controlled flexible curves, while Fourier-based activation functions are useful for modeling periodic patterns.
  3. Propose a new activation function of your own and evaluate its performance compared to existing activation functions (with experimental results and theoretical justification).

    • Note: When designing a new activation function, consider the ideal conditions for an activation function (non-linearity, differentiability, prevention of gradient disappearance/explosion, computational efficiency, etc.).

Exercise Answers

4.2.1 Basic Problems

  1. Formulas and graphs of Sigmoid, Tanh, ReLU, Leaky ReLU, GELU, Swish functions:

    Activation Function Formula Graph (Reference)
    Sigmoid \(\sigma(x) = \frac{1}{1 + e^{-x}}\) Sigmoid
    Tanh \(tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}\) Tanh
    ReLU \(ReLU(x) = max(0, x)\) ReLU
    Leaky ReLU \(LeakyReLU(x) = max(ax, x)\) , (\(a\) is a small constant, usually 0.01) (Leaky ReLU has a small slope(\(a\)) in the part where x < 0 of the ReLU graph)
    GELU \(GELU(x) = x\Phi(x)\) , (\(\Phi(x)\) is the Gaussian cumulative distribution function) GELU
    Swish \(Swish(x) = x \cdot sigmoid(\beta x)\) , (\(\beta\) is a constant or learning parameter) Swish
  2. Derivatives of each activation function: | Activation Function | Derivative | | :———- | :—————————————————————————————— | | Sigmoid | \(\sigma'(x) = \sigma(x)(1 - \sigma(x))\) | | Tanh | \(tanh'(x) = 1 - tanh^2(x)\) | | ReLU | \(ReLU'(x) = \begin{cases} 0, & x < 0 \\ 1, & x > 0 \end{cases}\) | | Leaky ReLU | \(LeakyReLU'(x) = \begin{cases} a, & x < 0 \\ 1, & x > 0 \end{cases}\) | | GELU | \(GELU'(x) = \Phi(x) + x\phi(x)\), (\(\phi(x)\) is the Gaussian probability density function) | | Swish | \(Swish'(x) = sigmoid(\beta x) + x \cdot sigmoid(\beta x)(1 - sigmoid(\beta x))\beta\) |

  3. FashionMNIST, Training and Accuracy Measurement of Neural Network without Activation Function:

    • A neural network without an activation function can only perform linear transformations, so it cannot model complex nonlinear relationships. Therefore, it shows low accuracy in complex datasets like FashionMNIST (around 10% accuracy).
  4. Comparison with ReLU Activation Function and Explanation of its Role:

    • A neural network using the ReLU activation function can achieve much higher accuracy by introducing nonlinearity (over 80% accuracy).
    • Layer-by-Layer Output: Without an activation function, the distribution of layer-by-layer output values shows only simple scale changes, but with ReLU, the distribution changes as negative values are suppressed to 0.
    • Gradient: Without an activation function, the gradient is simply propagated, but with ReLU, the gradient becomes 0 for negative inputs and does not propagate.
    • Dead Neurons: These do not occur when there is no activation function but can occur when using ReLU.
    • Role Summary: The activation function gives nonlinearity to the neural network, allowing it to approximate complex functions, and controls the gradient flow to aid in learning.

4.2.2 Application Problems

  1. PReLU, TeLU, STAF PyTorch Implementation:

    import torch
    import torch.nn as nn
    
    class PReLU(nn.Module):
        def __init__(self, num_parameters=1, init=0.25):
            super().__init__()
            self.alpha = nn.Parameter(torch.full((num_parameters,), init))
    
        def forward(self, x):
            return torch.max(torch.zeros_like(x), x) + self.alpha * torch.min(torch.zeros_like(x), x)

    class TeLU(nn.Module): def init(self, alpha=1.0): super().__init__() self.alpha = nn.Parameter(torch.tensor(alpha))

    def forward(self, x): return torch.where(x > 0, x, self.alpha * (torch.exp(x) - 1))

class STAF(nn.Module): def init(self, tau=25): super().__init__() self.tau = tau self.C = nn.Parameter(torch.randn(tau)) self.Omega = nn.Parameter(torch.randn(tau)) self.Phi = nn.Parameter(torch.randn(tau))

def forward(self, x):
    result = torch.zeros_like(x)
    for i in range(self.tau):
        result += self.C[i] * torch.sin(self.Omega[i] * x + self.Phi[i])
    return result
  1. FashionMNIST, Activation Function Comparison Experiment:

    • Train and compare the test accuracy of neural networks including PReLU, TeLU, and STAF.
    • The experimental results show that adaptive activation functions (PReLU, TeLU, STAF) tend to have higher accuracy than ReLU (in the order of STAF > TeLU > PReLU > ReLU).
  2. Gradient Distribution Visualization, “Dead Neuron” Ratio Measurement:

    • ReLU has a gradient of 0 for negative inputs, while PReLU, TeLU, and STAF propagate small gradient values even for negative inputs.
    • The “dead neuron” ratio is the highest in ReLU and lower in PReLU, TeLU, and STAF.
  3. Methods and Principles to Alleviate the “Dead Neuron” Problem:

    • Leaky ReLU: Allows a small slope for negative inputs to prevent neurons from being completely deactivated.
    • PReLU: Makes the slope of Leaky ReLU a learnable parameter to find the optimal slope based on the data.
    • ELU, SELU: Have non-zero values in the negative region and a smooth curve shape, alleviating the gradient vanishing problem and stabilizing learning.

4.2.3 Advanced Topics

  1. Rational Activation Function PyTorch Implementation, Characteristics, and Advantages/Disadvantages:

    import torch
    import torch.nn as nn
    
    class Rational(nn.Module):
        def __init__(self, numerator_coeffs, denominator_coeffs):
            super().__init__()
            self.numerator_coeffs = nn.Parameter(numerator_coeffs)
            self.denominator_coeffs = nn.Parameter(denominator_coeffs)

    def forward(self, x): numerator = torch.polyval(self.numerator_coeffs, x) # polynomial calculation denominator = 1 + torch.polyval(self.denominator_coeffs, torch.abs(x)) # absolute value and polynomial return numerator / denominator

  • Characteristics: Rational function (fractional function) form. The numerator and denominator are expressed as polynomials.
  • Advantages: Flexible function form. Superior performance to other activation functions in certain problems.
  • Disadvantages: Caution when the denominator is 0. Hyperparameter (polynomial coefficient) tuning required.
  1. B-spline or Fourier-based activation function PyTorch implementation, characteristics, and advantages/disadvantages:

    • B-spline activation function:

      import torch
      import torch.nn as nn
      from scipy.interpolate import BSpline
      import numpy as np
      
      class BSplineActivation(nn.Module):
          def __init__(self, knots, degree=3):
              super().__init__()
              self.knots = knots
              self.degree = degree
              self.coeffs = nn.Parameter(torch.randn(len(knots) + degree - 1)) # control points
      
          def forward(self, x):
              # B-Spline calculation
              b = BSpline(self.knots, self.coeffs.detach().numpy(), self.degree) # separate coefficients
              spline_values = torch.tensor(b(x.detach().numpy()), dtype=torch.float32) # input x into B-Spline
              return spline_values * self.coeffs.mean() # detach, numpy() or error
               # detach, numpy() or error
    • Characteristics: Locally controlled flexible curve. Shape adjusted by knots and degree.

    • Advantages: Smooth function expression. Local feature learning.

    • Disadvantages: Performance affected by knot setting. Increased computational complexity.

  2. Proposal of a new activation function and performance evaluation:

    • (Example) Activation function combining Swish and GELU:
    ```python
    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    class SwiGELU(nn.Module): # Swish + GELU
      def forward(self, x):
        return 0.5 * (x * torch.sigmoid(x) + F.gelu(x))
    ```
    
    SwiGELU combines the smoothness of Swish and the regularization effect of GELU.
    • Experimental design and performance evaluation: Comparison with existing activation functions on benchmark datasets such as FashionMNIST. (Experimental results omitted)

References

  1. Deep Learning (Goodfellow, Bengio, Courville, 2016): Chapter 6.3 (Activation Functions) https://www.deeplearningbook.org/
    • A textbook that covers comprehensive content about deep learning. It includes basic information about activation functions and other important concepts in deep learning.
  2. Understanding the difficulty of training deep feedforward neural networks (Glorot & Bengio, 2010) http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf
    • A paper that analyzes the gradient vanishing problem of Sigmoid and Tanh activation functions and proposes the Xavier initialization method. It is an important resource for understanding the difficulties of training deep neural networks.
  3. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification (He et al., 2015) https://arxiv.org/abs/1502.01852
    • A paper that proposes the ReLU activation function, PReLU activation function, and He initialization method. It helps to deepen understanding of the ReLU series activation functions widely used in modern deep learning.
  4. Searching for Activation Functions (Ramachandran et al., 2017) https://arxiv.org/abs/1710.05941
    • A paper that discovers the Swish activation function through neural architecture search (NAS). It provides ideas for exploring new activation functions.
  5. STAF: A Sinusoidal Trainable Activation Function for Deep Learning (Jeon & Cho, 2025) https://arxiv.org/abs/2405.13607
    • A recent (2025) paper presented at ICLR, which proposes STAF, a Fourier series-based trainable activation function. It helps to understand the latest research trends in adaptive activation functions.